In [84]:
import pandas as pd
import numpy as np
from matplotlib.ticker import FuncFormatter
import matplotlib.pyplot as plt
import matplotlib.lines as lines
import seaborn as sns
import datetime
import folium
import json
from branca.colormap import linear

%matplotlib notebook

Part 1 - Understanding the research

The goal of this data analysis research is to better understand sexism problems in Brazil. It is well known that women face much more obstacles then men in several areas in our society. But seeing those differences with charts makes easier to convey how large this difference is.

Making a quick search on google, you can find several articles about the disparity of women and men salaries, work top positions, percentage of doctorates and much more. Talking specifically about Brazil, we as a country suffer with lack of open data and infrastructure to understand the problems of our society. As a consequence, finding data about the differences of women and men such as opportunities and investments is a challenge.

Part 2 - Analyzing financial credits disparity between women and men

Since we live in a capitalist society, income is a major source of opportunity regardless of the area we are talking about. In other words, analysing the wage differences between women and men will enlight us torwards a better analytical approach to figure out why men have much more opportunities then women.

Difference between women's and men's small companies credit received by the federal government

In Brazil, we have several types of companies. Those types are defined by the amount of income they receive by year. The data below is released every quarter by the federal government of Brazil to show the amount of credit concieved to small entrepreneurs by the federal government.

The data can be found in the folowing links:

In [34]:
df_mei_fem = pd.read_csv('Data/saldo_credito_mei_feminino.csv', sep=';',encoding='latin1')
df_mei_fem['data'] = pd.to_datetime(pd.Series(df_mei_fem['data']), format="%d/%m/%Y")
df_mei_fem['data_quarter'] = df_mei_fem['data']
df_mei_fem['data_quarter'] = df_mei_fem['data_quarter'].dt.to_period("Q")
df_mei_fem = df_mei_fem.set_index('data')
df_mei_fem['valor'] = df_mei_fem['valor'].apply(lambda x: x.replace(',','.'))
df_mei_fem['valor'] = df_mei_fem['valor'].astype('float')


df_mei_mas = pd.read_csv('Data/saldo_credito_mei_masculino.csv', sep=';',encoding='latin1')
df_mei_mas['data'] = pd.to_datetime(pd.Series(df_mei_mas['data']), format="%d/%m/%Y")
df_mei_mas['data_quarter'] = df_mei_mas['data']
df_mei_mas['data_quarter'] = df_mei_mas['data_quarter'].dt.to_period("Q")
df_mei_mas = df_mei_mas.set_index('data')
df_mei_mas['valor'] = df_mei_mas['valor'].apply(lambda x: x.replace(',','.'))
df_mei_mas['valor'] = df_mei_mas['valor'].astype('float')

df_join = df_mei_fem.copy()
df_join = df_join.drop(columns=['valor'])
df_join['valor_fem'] = df_mei_fem['valor']
df_join['valor_mas'] = df_mei_mas['valor']
df_join['valor_max'] = df_join[['valor_fem', 'valor_mas']].values.max(1)
df_join['valor_min'] = df_join[['valor_fem', 'valor_mas']].values.min(1)
In [45]:
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(df_mei_fem.index, 'valor', data=df_mei_fem, 
         markerfacecolor='blue', markersize=12, color='#394989', 
         label = 'Women')

ax.plot(df_mei_mas.index , 'valor', data=df_mei_mas, 
         markerfacecolor='red', markersize=12, color='#cf1b1b',
         label = 'Men')

ax.fill_between(df_mei_mas.index.tolist(), df_join['valor_min'], df_join['valor_max'], facecolor='#000000', alpha=0.1)

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
ax.set_xticklabels(df_mei_fem['data_quarter'])

ttl = ax.set_title('Difference between women\'s and men\'s small\n companies credit received by the federal government')
ttl.set_position([0.5, 0.95])


plt.setp(ax.get_xticklabels(), rotation=30, ha="right")
plt.xlabel('Quarter', fontweight='bold', labelpad=20)
plt.ylabel('Millions of R$ - reais', fontweight='bold', labelpad=20)
plt.legend()

plt.subplots_adjust(bottom=0.3)
plt.show()

The chart above, show us the huge difference between the credit conceived for each gender. The data is a representation of the amount of incentive mens entrepreneurs receive over women entrepreneurs. Thus, is expected that men generate even more income, since they are receiving more credits. This chart alone isn't enough to prove the existence of gender disparity in Brazil but it can give a clear view about it.

Part 3 - Analysing the differance of grades between men and women in the national test to enter in the Univeristy

Understanding the grades disparity between young women and men may be a good indicator of why we see much more opportunities for the former than the latter. To achieve this goal it was used a national test - ENEM - such as SAT in United States.

In [117]:
dados_enem = pd.read_csv('Data/microdados_enem_2019/DADOS/MICRODADOS_ENEM_2019.csv', 
                         sep=';',encoding='latin1')  
dados_enem.head()
Out[117]:
NU_INSCRICAO NU_ANO CO_MUNICIPIO_RESIDENCIA NO_MUNICIPIO_RESIDENCIA CO_UF_RESIDENCIA SG_UF_RESIDENCIA NU_IDADE TP_SEXO TP_ESTADO_CIVIL TP_COR_RACA ... Q016 Q017 Q018 Q019 Q020 Q021 Q022 Q023 Q024 Q025
0 190001004627 2019 1506807 Santarém 15 PA 21 M 1 3 ... A A A C B A D A B A
1 190001004628 2019 1504059 Mãe do Rio 15 PA 16 F 1 3 ... A A A B B A B A A A
2 190001004629 2019 1505502 Paragominas 15 PA 18 F 1 1 ... B A A D B B D A C B
3 190001004630 2019 1507706 São Sebastião da Boa Vista 15 PA 23 M 0 3 ... A A A C A A D A A A
4 190001004631 2019 1503903 Juruti 15 PA 23 M 1 3 ... A A A B A A D A A A

5 rows × 136 columns

The test - ENEM - is devided in 5 areas:

  • Math
  • Physics, Byology and Chemistry
  • Portuguese
  • History and Geography
  • Essay

The idea is to calculate the average grade by gender and state. Thus, the charts obtained will give us a norrow view of which state has more disparity and which has not.

It is important to highlight that the grades in the ENEM are within a range of 0 to 1000.

In [171]:
dados_enem_g = dados_enem.groupby(['SG_UF_RESIDENCIA', 'TP_SEXO']).mean()
dados_enem_g = dados_enem_g[['NU_NOTA_CN', 'NU_NOTA_CH', 'NU_NOTA_LC', 'NU_NOTA_MT', 'NU_NOTA_REDACAO']]
dados_enem_g = dados_enem_g.reset_index()
dados_enem_g = dados_enem_g.rename(columns={"SG_UF_RESIDENCIA": "estado"})
dados_enem_g = dados_enem_g.set_index('estado')
dados_enem_g.head()
Out[171]:
TP_SEXO NU_NOTA_CN NU_NOTA_CH NU_NOTA_LC NU_NOTA_MT NU_NOTA_REDACAO
estado
AC F 443.544392 476.739907 500.369567 468.891653 537.773165
AC M 462.438381 487.122300 499.447489 501.590544 525.403072
AL F 450.874183 479.357213 500.879192 483.917513 558.253807
AL M 470.085167 494.751326 505.708799 523.387709 551.491121
AM F 444.126692 477.532589 499.359327 468.730168 513.152105
In [119]:
geo_json_data = json.load(open('Data/EstadosBrasileiros/br_states.json'))

Plot the average grade by region in Brazil

In [178]:
def catergoryMap(df, title):
    colormap = linear.YlOrRd_09.scale(df.min(),df.max())
        
    mapa = folium.Map(
        width=600, height=400,
        location=[-15.77972, -47.92972], 
        zoom_start=3.5
    )
    
    folium.GeoJson(
        geo_json_data,
        name='2019',
        style_function=lambda feature: {
            'fillColor': colormap(df[feature['id']]),
            'color': 'black',
            'weight': 0.3,
        }

    ).add_to(mapa)
    colormap.caption = title
    colormap.add_to(mapa)
    
    folium.LayerControl(collapsed=False).add_to(mapa)

    return mapa
In [176]:
## Women
df_w_mean_grade = dados_enem_g.copy()
df_w_mean_grade = df_w_mean_grade[df_w_mean_grade['TP_SEXO'] == 'F']
col = df_w_mean_grade.loc[: , "NU_NOTA_CN":"NU_NOTA_REDACAO"]
df_w_mean_grade['mean_grade'] = col.mean(axis=1)

## Men
df_m_mean_grade = dados_enem_g.copy()
df_m_mean_grade = df_m_mean_grade[df_m_mean_grade['TP_SEXO'] == 'M']
col = df_m_mean_grade.loc[: , "NU_NOTA_CN":"NU_NOTA_REDACAO"]
df_m_mean_grade['mean_grade'] = col.mean(axis=1)

Grade Average per region and gender

Women

In [180]:
catergoryMap(df_w_mean_grade['mean_grade'], 'Women mean grade per state')
Out[180]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Men

In [181]:
catergoryMap(df_m_mean_grade['mean_grade'], 'Men mean grade per state')
Out[181]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Conclusions obtained from the charts above

Since ENEM is a national test used to enter into the university, it can be used to analyse if men receive more incentives than women back in school. From the charts above, it can be seen that the grade difference isn't huge between women and men in several brazilian states. Therefore, if men do receives more incentives to study and do well in the university than women this is not reflected in the results of tests, in the grades. Because both grades averages are almost the same. Yet, the first chart analysed in this research showed us that men entrepreneurs receive much more financial credits than women entrepreneurs. Finally, that leaves with one crucial question. Since the typical entrepreneur is 42 years old as some studies shows and the national test is taken by students between 18 and 20 years old, how come the difference between women and men in financial credits are much larger than their average grades? The reason for asking this question is because one could argue that the difference between their conceived credits is due to a lack of education of women when compared with men, since education directly affects the income of a person in the long term in the current society we live in. However, the gap of grade isn’t large to support this assumption which leaves us to a more narrow assumption. What if this difference grow up due to some other factor between the age of 18 to 42, such as home responsibilities.

Brazil has a sexist culture and this has some serious consequences. Such as, women usually are obligated by their families and husbands to take care of their houses while mens are responsible for providing money to the family. Thereafter, mens have much more opportunities than women to create business and to consequently receive credits and a higher income in the long term.

Part 4 - Checking the assumption that women spend more time at home than men

It was used for this analysis the following data from Brazilian Institute of Geography and Statistics (IBGE):

In [287]:
df_ibge = pd.read_excel('Data/1_Estruturas_Economicas/Tabela 1.xls', 
                         sep=';',encoding='latin1', skiprows=5)  
df_ibge = df_ibge[['Unnamed: 0', 'Média.4', 'CV (%).4', 'Média.5', 'CV (%).5']]
df_ibge = df_ibge.rename(columns={
                "Média.4": "Men_mean", 
                "CV (%).4": "CV (%).M",
                "Média.5": "Women_mean", 
                "CV (%).5": "CV (%).W",
                "Unnamed: 0": "age",
        })
df_ibge = df_ibge.set_index('age')
df_ibge = df_ibge.drop(df_ibge.tail(5).index)
df_ibge = df_ibge.tail(4)
df_ibge = df_ibge.reset_index()
df_ibge.iloc[0, df_ibge.columns.get_loc('age')] = '14 to 29'
df_ibge.iloc[1, df_ibge.columns.get_loc('age')] = '30 to 49'
df_ibge.iloc[2, df_ibge.columns.get_loc('age')] = '50 to 59'
df_ibge.iloc[3, df_ibge.columns.get_loc('age')] = '60 or more'
df_ibge = df_ibge.set_index('age')
df_ibge.head()
Out[287]:
Men_mean CV (%).M Women_mean CV (%).W
age
14 to 29 9.608445 0.932617 15.754179 1.110726
30 to 49 10.854688 0.699839 18.793720 0.610449
50 to 59 10.491923 0.948139 19.156483 0.804151
60 or more 10.814891 1.340377 19.345568 1.265817
In [378]:
class draggable_lines:
    def __init__(self, ax, XorY, df):
        self.ax = ax
        self.c = ax.get_figure().canvas
        self.XorY = XorY
        self.df = df
        
        x = [-100, 100]
        y = [XorY, XorY]

        self.line = lines.Line2D(x, y, picker=5)
        self.ax.add_line(self.line)
        self.c.draw_idle()
        self.sid = self.c.mpl_connect('pick_event', self.clickonline)

    def clickonline(self, event):
        if event.artist == self.line:
            self.follower = self.c.mpl_connect("motion_notify_event", self.followmouse)
            self.releaser = self.c.mpl_connect("button_press_event", self.releaseonclick)

    def followmouse(self, event):
        self.line.set_ydata([event.ydata, event.ydata])
        self.c.draw_idle()

    def releaseonclick(self, event):
        self.XorY = self.line.get_ydata()[0]
        self.c.mpl_disconnect(self.releaser)
        self.c.mpl_disconnect(self.follower)
        arr = []
        
        for index, row in self.df.iterrows():
            if row['Men_mean'] + row['CV (%).M'] < self.XorY:
                ax.get_children()[df_ibge.index.get_loc(index)+2].set_color('#3c2946')
                
                
            if row['Men_mean'] + row['CV (%).M'] >= self.XorY:
                ax.get_children()[df_ibge.index.get_loc(index)+2].set_color('#050505')
                
                
            if row['Women_mean'] + row['CV (%).W'] < self.XorY:
                ax.get_children()[df_ibge.index.get_loc(index)+6].set_color('#cf1b1b')
                
                
            if row['Women_mean'] + row['CV (%).W'] >= self.XorY:
                ax.get_children()[df_ibge.index.get_loc(index)+6].set_color('#ff847c')
        
        plt.draw()

fig, ax = plt.subplots(1,1)

width = 0.4  # the width of the bars
ind = np.arange(4)   # the x locations for the groups

ax.bar(ind+width/2, df_ibge['Men_mean'], yerr=df_ibge['CV (%).M'],
       alpha=0.5, width=width, capsize=3, color='#050505', edgecolor = "black",
      linewidth=1.5, label='Men')
ax.bar(ind+width*3/2, df_ibge['Women_mean'], yerr=df_ibge['CV (%).W'],
       alpha=0.5, width=width, capsize=3, color='#ff847c', edgecolor = "black",
      linewidth=1.5, label='Women')
ax.set_xticks(ind + width)
ax.set_xticklabels((df_ibge.index))
ax.legend()
horizontal_line = draggable_lines(ax, df_ibge['Men_mean'].iloc[0], df_ibge)

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
ax.set_ylabel('Hours per day working\nwith domestic activities')
ax.set_xlabel('Age')

plt.show()

Conclusion

With the animated bar chart above, it is possible to draw important informations about our assumption. As can be seen, women from all ages spend more time with domestic activities than men. As a consequence, men can spend much more time working than women. In a long term, it is possible to conclude that women will receive less wages due to the lack of opportunities. Even though, women have the same average grade, in the national test as men in their teenage years, men will inevitably have better job opportunities if women needs to spend 8 hours more than men at their homes because of chores.

Final observations

The ultimate goal of this research was to show the huge disparity between women and men in Brazil since a lot of people still believe that those disparities don't exist. I hope to continue the studies made here to obtain better and more accurate results to raise awareness in people and try to help fix those disparities to a better future,